Special Topics in R
Module 5.1: fixest
All materials can be found at alexcardazzi.github.io.
Measurement Error
In the models we’ve created, we’ve included error terms in every one. Error terms are, effectively, all the other things we cannot observe that still effect the outcome, as well as just randomness.
Sometimes in social science, business, etc., we are unable to perfectly measure our variables, which we call measurement error. For example, the unemployment rate is based on surveys, which we know have inherent error.
What are the effects of measurement error?
Measurement Error in X
Measurement error in exogenous variables biases our coefficients towards zero. Note the difference between this and omitted variable bias. Omitted variable bias pushes coefficients in a specific direction, positive or negative. If the omitted variable bias is negative, true negative coefficients become more negative and true positive coefficients become less positive1.
However, with measurement error, coefficients get squished down towards zero, not pushed up or down.
Measurement Error in X Visual
Notice in the visual above how as the points in X spread out (i.e. increase measurement error), the slope of the line becomes more shallow. The most shallow line has a slope of zero.
Below is a figure where we can plot the coefficients of the regression each time we increase the measurement error.
You can see that the coefficient nose-dives toward zero as measurement error of \(X\) increases.2
Measurement Error in Y
The (random) measurement error in \(Y\) gets absorbed into the error term, so in reality, this just makes inference more difficult (by increasing standard errors). The following plot demonstrates the effect of increasing measurement error in Y on coefficient estimates.
Plot
Similarly, the figure below shows the effect of measurement error in \(Y\) on standard errors.
Plot
In summary, measurement error in \(X\) variables will bias coefficients towards zero. Measurement error in \(Y\) does not bias your coefficients, but does inflate their standard errors.
Standard Errors
When we’re first learning econometrics, much is made out about estimating the \(\beta\)’s. Don’t get it wrong, this is incredibly important. However, as you learn more econometrics, you begin to realize that standard errors are as important, if not even more important, than the coefficients.
Ultimately, we are using econometrics to build models and test hypotheses. We can build the best model in the world, but if our standard errors are wrong, we’ll draw false conclusions.
We briefly touched on the word heteroskedasticity in previous modules. However, we did not ever really discuss its implications.
We assume that the errors resulting from the model are homoskedastic in order to use “normal”3 standard errors. This comes from the CLT.
So, what happens if we have heteroskedasticity, which is a violation of this assumption?
We are going to simulate data using the following equation:
\[Y_i = 10 + 1\times X_i + e_i\]
Each time we simulate data, however, we are going to force different distributions onto \(e_i\). Below, I plot the simulated \(X\) and \(Y\) data with different error distributions.
Plot
Next, we can visualize the relationship between \(X\) and each regression’s residuals.
Plot
Examining the parameter estimates resulting from these data:
Table of Regression Estimates and Standard Errors
| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| X | 0.946*** | 0.805*** | 0.981*** | 0.884*** | 0.997*** |
| (0.047) | (0.137) | (0.046) | (0.181) | (0.041) | |
| Num.Obs. | 500 | 500 | 500 | 500 | 500 |
| R2 | 0.454 | 0.065 | 0.483 | 0.046 | 0.548 |
- Column (1) contains estimates from a homoskedastic error process. The standard errors are relatively small and the coefficient is close to 1, the true value of beta.
- Column (2) contains estimates from a “fan” shaped, heteroskedastic error process. The standard error is about three times as large as the homoskedastic error process.
- Columns (3) and (5) appear to be relatively similar to column (1)
- Column (4) appears to be similar to column (2) with both coefficient and standard error.
In general, our coefficients are similar, but our standard errors differ quite a bit.
To address this issue, we can use heteroskedasticity-robust standard errors. Essentially, without getting too deep into the math, we are going to correct for the non-constant variance of the errors. This will give us better standard errors for testing our hypotheses!
To apply this correction, we are going to use an R package called fixest.
At this point, the main benefits of this package are two fold. First, fixed effects are estimated extremely quickly. We’ll do some “speed tests” in just a bit. Second, we can apply different standard error corrections to our models.
feols
fixest comes with a few different functions, but we will mostly be interested in feols() which stands for “fixed effects ordinary least squares”. The important arguments into this function are:
data: the data for the regressions.subset: a vector that tellsRwhich observations out ofdatato use.split/fsplit: sometimes we want to estimate regressions on different subsamples of data. For example, in an early module we estimated the effect of GPA on SAT scores by gender.splitwill automatically estimate the sample based on unique values of the vector passed to it.fsplitwill do the same, except it will also estimate models for the full sample.vcov: this is how you will specify the variance-covariance matrix (or, in English, the standard errors). For our purposes, “normal” standard errors are"iid"and heteroskedasticity-robust standard errors will be"hetero". Note:feolsdefaults to"cluster"when using fixed effects, so you will need to manually override this with one of the two options above.fml: this is the formula to be estimated. For the purposes of this course,fmlwill look like the following:c(y1, y2, y3) ~ x1 + x2 + x3*x4 | FE1 + FE2, wherec(y1, y2, y3)are the names of the outcome variables.feolscan estimate models with multiple outcomes at once!xvariables go after the squiggle~, and fixed effects follow the vertical bar|.
An example4 of feols and fsplit
Code
library("modelsummary")
library("fixest")
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
df <- df[df$hs_gpa <= 4,]
r <- feols(sat_sum ~ hs_gpa, df, fsplit = ~sex)
# Note that r is now a "list" of three regressions.
regz <- list(`All Students` = r[[1]],
`Female` = r[[2]],
`Male` = r[[3]])
coefz <- c("hs_gpa" = "High School GPA",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of GPA on SAT Scores",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)| All Students | Female | Male | |
|---|---|---|---|
| High School GPA | 11.538*** | 11.392*** | 13.715*** |
| (0.753) | (0.999) | (1.079) | |
| Constant | 66.468*** | 70.196*** | 55.851*** |
| (2.440) | (3.167) | (3.579) | |
| Num.Obs. | 999 | 515 | 484 |
| R2 | 0.191 | 0.202 | 0.251 |
For speed comparison, we are going to simulate data of varying sizes with a proportional number of fixed effects (e.g. 1,000 observations and 10 FEs; 10,000 obs and 100 FEs) and see how long lm vs feols take.
Code
lm_time <- c()
feols_time <- c()
nz <- c(1000, 5000, 10000, 20000, 35000, 50000, 75000, 100000)
for(n in nz){
x <- rnorm(n)
fe <- sample(1:(n/100), n, TRUE)
y <- x + rnorm(n)
df <- data.frame(y = y, x = x, fe = fe)
lm_time <- c(lm_time, system.time(lm(y ~ x + as.factor(fe), data = df))[3])
feols_time <- c(feols_time, system.time(feols(y ~ x | fe, df))[3])
}
plot(nz, lm_time, pch = 19, col = "tomato", type = "b",
xlab = "Sample Size", ylab = "Time in Seconds", las = 1)
lines(nz, feols_time, pch = 19, col = "dodgerblue", type = "b")
legend("topleft", legend = c("lm()", "feols()"), bty = "n",
col = c("tomato", "dodgerblue"), lty = 1)
cat("Maximum lm() time:", max(lm_time), "\n")
cat("Maximum feols() time:", max(feols_time))Output
Maximum lm() time: 102.26
Maximum feols() time: 0.02
Plot
The following table contains coefficient estimates and analytical standard errors from before.
Regression Table
| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| X | 0.946*** | 0.805*** | 0.981*** | 0.884*** | 0.997*** |
| (0.047) | (0.137) | (0.046) | (0.181) | (0.041) | |
| Num.Obs. | 500 | 500 | 500 | 500 | 500 |
| R2 | 0.454 | 0.065 | 0.483 | 0.046 | 0.548 |
This next table adjusts the standard errors to account for possible heteroskedasticity.
Regression Table
| (1) | (2) | (3) | (4) | (5) | |
|---|---|---|---|---|---|
| X | 0.946*** | 0.805*** | 0.981*** | 0.884*** | 0.997*** |
| (0.047) | (0.157) | (0.076) | (0.048) | (0.055) | |
| Num.Obs. | 500 | 500 | 500 | 500 | 500 |
| R2 | 0.454 | 0.065 | 0.483 | 0.046 | 0.548 |
In some cases, the standard errors increased. In other cases, the standard errors decreased. So what? When standard errors grow larger, it means that we would have been rejecting the null too often, making our relationship seem stronger than it really is. Vice versa is true as well.
Footnotes
True positive coefficients can even become negative altogether. See the bedrooms and square footage example from Module 4.↩︎
Note: the true parameter value is 1.↩︎
Normal standard errors are called analytical or iid (independently and identically distributed) standard errors.↩︎
This example comes directly from the beginning of Module 4.2↩︎